Scenario 1: sampling minibatches from fully observed datasets

To perform online iNMF, we need to install the latest Liger package from GitHub. Please see the instruction below.

library(devtools)
install_github("welch-lab/liger")

We first create a Liger object by passing the filenames of HDF5 files containing the raw count data. The data can be downloaded here. Liger assumes by default that the HDF5 files are formatted by the 10X CellRanger pipeline. Large datasets are often generated over multiple 10X runs (for example, multiple biological replicates). In such cases it may be necessary to merge the HDF5 files from each run into a single HDF5 file. We provide the mergeH5 function for this purpose (see below for details).

library(liger)
pbmcs = createLiger(list(stim = "pbmcs_stim.h5", ctrl = "pbmcs_ctrl.h5"))

We then perform the normalization, gene selection, and gene scaling in an online fashion, reading the data from disk in small batches.

pbmcs = normalize(pbmcs)
pbmcs = selectGenes(pbmcs, var.thresh = 0.2, do.plot = F)
pbmcs = scaleNotCenter(pbmcs)

Online Integrative Nonnegative Matrix Factorization

Now we can use online iNMF to factorize the data, again using only minibatches that we read from the HDF5 files on demand (default mini-batch size = 5000). Sufficient number of iterations is crucial for obtaining ideal factorization result. If the size of the mini-batch is set to be close to the size of the whole dataset (i.e. an epoch only contains one iteration), max.epochs needs to be increased accordingly for more iterations.

pbmcs = online_iNMF(pbmcs, k = 20, miniBatch_size = 5000, max.epochs = 5)

Quantile Normalization and Downstream Analysis

After performing the factorization, we can perform quantile normalization to align the datasets.

pbmcs = quantile_norm(pbmcs)

We can also visualize the cell factor loadings in two dimensions using t-SNE or UMAP.

pbmcs = runUMAP(pbmcs)
plotByDatasetAndCluster(pbmcs, axis.labels = c("UMAP1","UMAP2"))

Let’s first evaluate the quality of data alignment. The alignment score ranges from 0 (no alignment) to 1 (perfect alignment).

calcAlignment(pbmcs)
## [1] 0.9149238

With HDF5 files as input, we need to sample the raw, normalized, or scaled data from the full dataset on disk and load them in memory. Some plotting functions and downstream analyses are designed to operate on a subset of cells sampled from the full dataset. This enables rapid analysis using limited memory. The readSubset function allows either uniform random sampling or sampling balanced by cluster. Here we extract the normalized count data of 5000 sampled cells.

pbmcs = readSubset(pbmcs, slot.use = "norm.data", max.cells = 5000)

Using the sampled data stored in memory, we can now compare clusters or datasets (within each cluster) to identify differentially expressed genes. The runWilcoxon function performs differential expression analysis by performing an in-memory Wilcoxon rank-sum test on this subset. Thus, users can still analyze large datasets with a fixed amount of memory.

de_genes = runWilcoxon(pbmcs, compare.method = "datasets")

Here we show the top 10 genes in cluster 1 whose expression level significantly differ between two dataset.

de_genes = de_genes[order(de_genes$padj), ]
head(de_genes[de_genes$group == "1-stim",], n = 10)
##       feature  group    avgExpr     logFC statistic       auc          pval
## 14061   ISG15 1-stim  -5.430433 13.820208  121239.0 0.9898839 6.419043e-114
## 14324    IFI6 1-stim  -6.843238 14.763583  118624.5 0.9685372 2.942152e-108
## 23989   ISG20 1-stim  -5.779989  9.286306  117816.5 0.9619401  4.421875e-99
## 21242   IFIT3 1-stim  -8.320794 14.539736  114625.0 0.9358824  1.133535e-98
## 21244   IFIT1 1-stim  -9.094809 13.669177  112456.0 0.9181731  2.367054e-92
## 27532     MX1 1-stim  -8.935794 12.875135  111536.5 0.9106656  5.884017e-87
## 20364    LY6E 1-stim  -8.721669 12.810795  111435.0 0.9098369  9.506436e-86
## 23754     B2M 1-stim  -3.188800  0.461932  108547.5 0.8862612  3.907049e-69
## 21241   IFIT2 1-stim -11.928663 10.782880  102299.0 0.8352439  3.361002e-66
## 14634  IFI44L 1-stim -12.345872  9.720498   98779.5 0.8065081  3.120570e-55
##                padj pct_in pct_out
## 14061 9.020681e-110    100     100
## 14324 2.067303e-104    100     100
## 23989  2.071354e-95    100     100
## 21242  3.982393e-95    100     100
## 21244  6.652841e-89    100     100
## 27532  1.378135e-83    100     100
## 20364  1.908485e-82    100     100
## 23754  6.863219e-66    100     100
## 21241  5.248019e-63    100     100
## 14634  4.385337e-52    100     100

We can show UMAP coordinates of sampled cells by their loadings on each factor (Factor 1 as an example). Underneath it displays the most highly loading shared and dataset-specific genes, with the size of the marker indicating the magnitude of the loading.

p_wordClouds = plotWordClouds(pbmcs, num.genes = 5, return.plots = T)
p_wordClouds[[1]]

We can generate plots of dimensional reduction coordinates colored by expression of specified gene. The first two UMAP dimensions and gene ISG15 (identified by Wilcoxon test in the previous step) is used as an example here.

plotGene(pbmcs, "ISG15", return.plots = F)

Furthermore, we can make violin plots of expression of specified gene for each dataset (ISG15 as an example).

plotGeneViolin(pbmcs, "ISG15", return.plots = F)

The online algorithm can be implemented on datasets loaded in memory as well. The same analysis is performed on the PBMCs, shown below.

stim = readRDS("pbmcs_stim.RDS")
ctrl = readRDS("pbmcs_ctrl.RDS")
pbmcs_mem = createLiger(list(stim = stim, ctrl = ctrl), remove.missing = F)
pbmcs_mem = normalize(pbmcs_mem)
pbmcs_mem = selectGenes(pbmcs_mem, var.thresh = 0.2, do.plot = F)
pbmcs_mem = scaleNotCenter(pbmcs_mem)
pbmcs_mem = online_iNMF(pbmcs_mem, k = 20, miniBatch_size = 5000, max.epochs = 5)
pbmcs_mem = quantile_norm(pbmcs_mem)
pbmcs_mem = runUMAP(pbmcs_mem)
plotByDatasetAndCluster(pbmcs_mem, axis.labels = c("UMAP1","UMAP2"))

As mentioned above, it is sometimes necessary to merge multiple HDF5 files (such as multiple 10X runs from the same tissue or condition) into a single file. We provide the mergeH5 function for this purpose. The function takes as input a list of filenames to merge (file.list), a vector of sample names that are prepended to the cell barcodes (library.names), and the name of the merged HDF5 file. The function requires that all files to be merged include exactly the same set of genes. For example, we can merge the cells and nuclei datasets used in the examples below (note that merging these particular two datasets is not something you would like to do, and is purely to demonstrate the mergeH5 function).

mergeH5(file.list = list("allen_smarter_cells.h5", "allen_smarter_nuclei.h5"), 
        library.names = c("cells","nuclei"),
        new.filename = "cells_nuclei")

Scenario 2: iterative refinement by incorporating new datasets

We can also perform online iNMF with continually arriving datasets.

MOp = createLiger(list(cells = "allen_smarter_cells.h5"))
MOp = normalize(MOp)
MOp = selectGenes(MOp, var.thresh = 2)
MOp.vargenes = MOp@var.genes
MOp = scaleNotCenter(MOp)
MOp = online_iNMF(MOp, k = 40, max.epochs = 1)
MOp = quantile_norm(MOp)
MOp = runUMAP(MOp)
plotByDatasetAndCluster(MOp, axis.labels = c("UMAP1","UMAP2"))

MOp2 = createLiger(list(nuclei = "allen_smarter_nuclei.h5"))
MOp2 = normalize(MOp2)
MOp2@var.genes = MOp@var.genes
MOp2 = scaleNotCenter(MOp2)
MOp = online_iNMF(MOp, X_new = list(nuclei = MOp2), k = 40, max.epochs = 1)
MOp = quantile_norm(MOp)
MOp = runUMAP(MOp)
plotByDatasetAndCluster(MOp, axis.labels = c("UMAP1","UMAP2"))

Scenario 3: projecting new datasets

MOp = createLiger(list(cells = "allen_smarter_cells.h5"))
MOp@var.genes = MOp.vargenes
MOp = online_iNMF(MOp, k = 40, max.epochs = 1)
MOp = quantile_norm(MOp)
MOp = runUMAP(MOp)
plotByDatasetAndCluster(MOp, axis.labels = c("UMAP1","UMAP2"))

MOp = online_iNMF(MOp, X_new = list(nuclei = MOp2), k = 40, project = TRUE)
MOp = quantile_norm(MOp)
MOp = runUMAP(MOp)
plotByDatasetAndCluster(MOp, axis.labels = c("UMAP1","UMAP2"))